5 + 5
# Use comments to explain (the why of) your code. print(2 * (18 - 7))
We import, wrangle, visualize, and model data using functions. Functions are composed of arguments that tell the function how to operate.
The tidyverse is a collection of R packages that share common philosophies and are designed to work together. An R package is a collection of functions, documentation, and sometimes data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
customer_data <- read_csv("customer_data.csv")
## Rows: 10531 Columns: 14 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (8): gender, married, college_degree, region, state, review_time, review... ## dbl (6): customer_id, birth_year, income, credit, review_id, star_rating ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
customer_data now appears in our Environment.<-.customer_data
## # A tibble: 10,531 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1002 1970 Female 31000 749. Yes No West ## 3 1003 1988 Male 35000 542. No No South ## 4 1004 1984 Other 64000 574. Yes Yes Midwest ## 5 1005 1987 Male 58000 644. No Yes West ## 6 1006 1994 Male 164000 554. Yes Yes Midwest ## 7 1007 1968 Male 39000 608. No No Midwest ## 8 1008 1994 Male 69000 710. No No South ## 9 1009 1958 Male 233000 702. No No West ## 10 1010 1994 Female 77000 605. Yes No Northeast ## # ℹ 10,521 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
glimpse(customer_data)
## Rows: 10,531 ## Columns: 14 ## $ customer_id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1… ## $ birth_year <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1… ## $ gender <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M… ## $ income <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000… ## $ credit <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6… ## $ married <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"… ## $ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"… ## $ region <chr> "South", "West", "South", "Midwest", "West", "Midwest",… ## $ state <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "… ## $ review_id <dbl> 933551, NA, NA, NA, 501318, 109125, NA, 1959683, NA, 19… ## $ star_rating <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N… ## $ review_time <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",… ## $ review_title <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al… ## $ review_text <chr> "everything's fine", NA, NA, NA, "I looked all over the…
The heart of data wrangling is transforming data.
{dplyr} provides a consistent grammar for transforming data with functions (a.k.a., verbs) that mirror SQL.
We often want to filter our data to keep certain observations using column values.
filter(customer_data, college_degree == "Yes")
## # A tibble: 7,782 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1004 1984 Other 64000 574. Yes Yes Midwest ## 2 1005 1987 Male 58000 644. No Yes West ## 3 1006 1994 Male 164000 554. Yes Yes Midwest ## 4 1013 1974 Female 197000 680. Yes Yes West ## 5 1015 1995 Male 347000 658. No Yes West ## 6 1016 1984 Other 292000 626. Yes Yes Northeast ## 7 1017 1997 Male 68000 588. No Yes Midwest ## 8 1018 1984 Male 184000 666. No Yes Northeast ## 9 1019 1984 Female 60000 758. No Yes West ## 10 1021 1997 Female 20000 503. No Yes West ## # ℹ 7,772 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
filter(customer_data, region != "West")
## # A tibble: 5,436 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1003 1988 Male 35000 542. No No South ## 3 1004 1984 Other 64000 574. Yes Yes Midwest ## 4 1006 1994 Male 164000 554. Yes Yes Midwest ## 5 1007 1968 Male 39000 608. No No Midwest ## 6 1008 1994 Male 69000 710. No No South ## 7 1010 1994 Female 77000 605. Yes No Northeast ## 8 1012 1953 Female 126000 673. Yes No South ## 9 1014 1997 Male 49000 666. Yes No Northeast ## 10 1016 1984 Other 292000 626. Yes Yes Northeast ## # ℹ 5,426 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
How would we filter by gender == "Female" and income > 70000?
Why are we putting quotes around "Female" but not gender?
filter(customer_data, gender == "Female", income > 70000)
## # A tibble: 3,970 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1010 1994 Female 77000 605. Yes No Northeast ## 3 1012 1953 Female 126000 673. Yes No South ## 4 1013 1974 Female 197000 680. Yes Yes West ## 5 1022 1979 Female 155000 805. No Yes West ## 6 1023 1995 Female 137000 539. No Yes Northeast ## 7 1024 1974 Female 285000 685. Yes Yes Midwest ## 8 1028 1980 Female 87000 715. No No West ## 9 1030 1969 Female 163000 636. Yes Yes West ## 10 1036 1978 Female 227000 614. Yes Yes Northeast ## # ℹ 3,960 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
Sometimes we want to slice our data to keep certain observations using their position in the data.
slice(customer_data, 1:5)
## # A tibble: 5 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1002 1970 Female 31000 749. Yes No West ## 3 1003 1988 Male 35000 542. No No South ## 4 1004 1984 Other 64000 574. Yes Yes Midwest ## 5 1005 1987 Male 58000 644. No Yes West ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
We can arrange observations to reveal helpful information and check data.
arrange(customer_data, birth_year)
## # A tibble: 10,531 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 9585 1939 Female 182000 850 Yes No Northeast ## 2 2409 1947 Male 227000 807. Yes Yes Midwest ## 3 4777 1947 Male 169000 651. Yes Yes Northeast ## 4 5165 1947 Female 258000 850 No Yes Northeast ## 5 8277 1947 Male 92000 839. Yes No West ## 6 8525 1947 Female 170000 723. Yes Yes West ## 7 9437 1947 Female 95000 850 No Yes West ## 8 10069 1947 Other 84000 743. Yes Yes West ## 9 4292 1948 Female 169000 760. Yes No Northeast ## 10 5282 1948 Male 176000 705. Yes No South ## # ℹ 10,521 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
arrange(customer_data, desc(birth_year))
## # A tibble: 10,531 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1026 1999 Male 66000 643. No No South ## 2 1049 1999 Other 88000 630. No Yes West ## 3 1092 1999 Other 77000 664. No Yes Midwest ## 4 1107 1999 Female 97000 579. No Yes West ## 5 1113 1999 Male 190000 661. Yes Yes Northeast ## 6 1126 1999 Female 121000 639. No Yes Midwest ## 7 1132 1999 Male 53000 669. No No West ## 8 1139 1999 Female 293000 659. Yes Yes Northeast ## 9 1143 1999 Female 74000 587. No Yes West ## 10 1147 1999 Other 109000 429. Yes No West ## # ℹ 10,521 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
Sometimes we only care to keep certain variables, especially when working with a large dataset.
select(customer_data, region:review_text)
## # A tibble: 10,531 × 7 ## region state review_id star_rating review_time review_title review_text ## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 South DC 933551 4 06 11, 2015 Four Stars everything… ## 2 West WA NA NA <NA> <NA> <NA> ## 3 South AR NA NA <NA> <NA> <NA> ## 4 Midwest MN NA NA <NA> <NA> <NA> ## 5 West HI 501318 5 03 25, 2008 Great Product!! I looked a… ## 6 Midwest MN 109125 2 06 7, 2013 Not at all fla… I ordered … ## 7 Midwest MN NA NA <NA> <NA> <NA> ## 8 South KY 1959683 5 04 20, 2016 Super comforta… Super comf… ## 9 West NM NA NA <NA> <NA> <NA> ## 10 Northeast VT 1907775 5 10 18, 2015 nice strong en… Yeah ## # ℹ 10,521 more rows
We can also recode existing variables or create new variables.
mutate(customer_data, income = income / 1000)
## # A tibble: 10,531 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73 742. No No South ## 2 1002 1970 Female 31 749. Yes No West ## 3 1003 1988 Male 35 542. No No South ## 4 1004 1984 Other 64 574. Yes Yes Midwest ## 5 1005 1987 Male 58 644. No Yes West ## 6 1006 1994 Male 164 554. Yes Yes Midwest ## 7 1007 1968 Male 39 608. No No Midwest ## 8 1008 1994 Male 69 710. No No South ## 9 1009 1958 Male 233 702. No No West ## 10 1010 1994 Female 77 605. Yes No Northeast ## # ℹ 10,521 more rows ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
Two important things to remember:
_. For example: good_name2 and 2bad name. Bad names should be renamed (but can be referenced if surrounded with " ").A common variable (e.g., an ID) allows us to join two data frames.
store_transactions <- read_csv("store_transactions.csv")
## Rows: 10531 Columns: 169 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## dbl (169): customer_id, jan_2005, feb_2005, mar_2005, apr_2005, may_2005, ju... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(store_transactions)
## Rows: 10,531 ## Columns: 169 ## $ customer_id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010… ## $ jan_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0… ## $ feb_2005 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 0, 4, 0… ## $ mar_2005 <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 5, 0… ## $ apr_2005 <dbl> 4, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0… ## $ may_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jun_2005 <dbl> 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0… ## $ jul_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1… ## $ aug_2005 <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0… ## $ sep_2005 <dbl> 0, 0, 0, 0, 0, 4, 0, 5, 4, 0, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0… ## $ oct_2005 <dbl> 4, 0, 0, 0, 3, 0, 5, 2, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 4… ## $ nov_2005 <dbl> 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 5, 3, 0, 2, 2, 0, 2, 0, 5, 0… ## $ dec_2005 <dbl> 3, 0, 1, 0, 0, 2, 1, 2, 5, 0, 0, 5, 0, 2, 4, 0, 2, 2, 3, 0… ## $ jan_2006 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 4, 2, 3, 0, 0, 4, 4… ## $ feb_2006 <dbl> 2, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5, 0, 4, 0, 0, 0, 0, 0, 0, 0… ## $ mar_2006 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0… ## $ apr_2006 <dbl> 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 5, 3… ## $ may_2006 <dbl> 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jun_2006 <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 4, 0, 5… ## $ jul_2006 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 4… ## $ aug_2006 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ sep_2006 <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0… ## $ oct_2006 <dbl> 0, 2, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ nov_2006 <dbl> 0, 0, 0, 4, 1, 1, 0, 0, 0, 3, 3, 3, 5, 1, 0, 0, 0, 0, 0, 0… ## $ dec_2006 <dbl> 5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 5, 3, 0, 0, 0, 0, 3… ## $ jan_2007 <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 0, 0, 4, 4, 0, 0, 0, 0, 4, 3… ## $ feb_2007 <dbl> 2, 0, 0, 0, 2, 0, 0, 3, 4, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0… ## $ mar_2007 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ apr_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1… ## $ may_2007 <dbl> 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jun_2007 <dbl> 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 3, 1, 5… ## $ jul_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 0, 0, 2, 0, 0, 0, 0, 2… ## $ aug_2007 <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ sep_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 3… ## $ oct_2007 <dbl> 3, 3, 0, 0, 0, 3, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 5, 0, 0… ## $ nov_2007 <dbl> 0, 2, 0, 5, 0, 5, 0, 0, 0, 5, 1, 0, 4, 0, 0, 2, 2, 3, 0, 3… ## $ dec_2007 <dbl> 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 4, 0, 1, 4, 0, 4, 0, 3, 0… ## $ jan_2008 <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 4, 0, 0, 0, 0… ## $ feb_2008 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ mar_2008 <dbl> 5, 5, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ apr_2008 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0, 1, 1… ## $ may_2008 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0… ## $ jun_2008 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0… ## $ jul_2008 <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 0, 4, 0, 2, 4, 0, 2, 0… ## $ aug_2008 <dbl> 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 5, 0, 2, 0, 4, 0, 3, 0, 1, 0… ## $ sep_2008 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 5, 0, 0, 0, 2… ## $ oct_2008 <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 2, 0, 3, 0, 0, 3, 3, 0, 0, 4… ## $ nov_2008 <dbl> 1, 0, 0, 2, 2, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 2, 4, 0… ## $ dec_2008 <dbl> 0, 0, 0, 0, 2, 3, 1, 0, 0, 0, 2, 2, 0, 4, 4, 0, 1, 0, 5, 1… ## $ jan_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0… ## $ feb_2009 <dbl> 2, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0… ## $ mar_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0… ## $ apr_2009 <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 5… ## $ may_2009 <dbl> 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 5, 4, 2, 0, 2, 1, 0… ## $ jun_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jul_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 3, 0, 1, 5, 3… ## $ aug_2009 <dbl> 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3… ## $ sep_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 0, 0… ## $ oct_2009 <dbl> 5, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0… ## $ nov_2009 <dbl> 5, 0, 0, 0, 0, 1, 0, 2, 2, 0, 4, 2, 0, 3, 4, 0, 4, 1, 2, 4… ## $ dec_2009 <dbl> 0, 0, 0, 3, 3, 0, 0, 4, 1, 4, 2, 2, 0, 0, 3, 0, 0, 0, 5, 1… ## $ jan_2010 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 5… ## $ feb_2010 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0… ## $ mar_2010 <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ apr_2010 <dbl> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 2, 0… ## $ may_2010 <dbl> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0… ## $ jun_2010 <dbl> 4, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 3… ## $ jul_2010 <dbl> 0, 0, 4, 0, 0, 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0… ## $ aug_2010 <dbl> 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 4, 3, 0, 0, 0, 0, 0, 0… ## $ sep_2010 <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0, 3… ## $ oct_2010 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 4, 0, 0, 0, 0, 2, 0… ## $ nov_2010 <dbl> 0, 0, 1, 2, 3, 5, 4, 1, 0, 4, 0, 3, 5, 0, 0, 0, 0, 3, 3, 0… ## $ dec_2010 <dbl> 0, 0, 5, 3, 0, 0, 5, 0, 0, 0, 0, 0, 4, 4, 0, 5, 3, 0, 0, 5… ## $ jan_2011 <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 5, 2, 1… ## $ feb_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0… ## $ mar_2011 <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 4… ## $ apr_2011 <dbl> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 3, 0… ## $ may_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4… ## $ jun_2011 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0… ## $ jul_2011 <dbl> 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0… ## $ aug_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 2… ## $ sep_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ oct_2011 <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0… ## $ nov_2011 <dbl> 5, 0, 0, 1, 0, 0, 5, 1, 2, 0, 0, 0, 4, 0, 5, 0, 0, 2, 3, 2… ## $ dec_2011 <dbl> 1, 4, 0, 4, 1, 5, 0, 4, 0, 0, 2, 0, 0, 0, 0, 2, 2, 4, 3, 4… ## $ jan_2012 <dbl> 3, 0, 0, 0, 0, 3, 0, 0, 4, 4, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0… ## $ feb_2012 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 0, 0, 5, 1, 0, 0, 0, 0… ## $ mar_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0… ## $ apr_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ may_2012 <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 2… ## $ jun_2012 <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 3, 3, 0, 0, 0, 0, 3, 0, 5, 0, 2, 0… ## $ jul_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0… ## $ aug_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 0, 0, 0, 0, 1, 0, 0, 4, 2, 0… ## $ sep_2012 <dbl> 0, 5, 1, 4, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1… ## $ oct_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 0, 0, 0, 0… ## $ nov_2012 <dbl> 3, 2, 0, 0, 0, 0, 0, 5, 4, 4, 2, 0, 0, 5, 0, 2, 3, 0, 1, 5… ## $ dec_2012 <dbl> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0, 4, 4, 1, 4, 0, 3, 4, 0… ## $ jan_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 4, 4… ## $ feb_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0… ## $ mar_2013 <dbl> 0, 4, 0, 3, 4, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0… ## $ apr_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0… ## $ may_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jun_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 3, 5, 0, 0, 0… ## $ jul_2013 <dbl> 0, 0, 4, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0… ## $ aug_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ sep_2013 <dbl> 1, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ oct_2013 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0… ## $ nov_2013 <dbl> 0, 0, 0, 0, 4, 0, 2, 1, 4, 2, 3, 0, 0, 3, 0, 0, 2, 0, 3, 3… ## $ dec_2013 <dbl> 0, 4, 4, 0, 4, 2, 0, 0, 4, 0, 0, 3, 2, 3, 3, 2, 0, 3, 2, 2… ## $ jan_2014 <dbl> 3, 0, 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0… ## $ feb_2014 <dbl> 0, 2, 0, 0, 0, 1, 0, 4, 0, 5, 0, 0, 5, 1, 0, 0, 0, 3, 1, 0… ## $ mar_2014 <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2… ## $ apr_2014 <dbl> 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0… ## $ may_2014 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 3, 2, 0, 0, 0, 0, 0, 2… ## $ jun_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 3, 0… ## $ jul_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 4, 0, 0… ## $ aug_2014 <dbl> 2, 3, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ sep_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 5, 0, 1, 4, 0… ## $ oct_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 0, 0, 2… ## $ nov_2014 <dbl> 4, 2, 3, 4, 0, 3, 3, 0, 0, 0, 0, 0, 4, 4, 0, 4, 0, 0, 1, 3… ## $ dec_2014 <dbl> 3, 0, 4, 0, 0, 1, 0, 0, 0, 3, 4, 0, 3, 0, 0, 0, 0, 2, 0, 5… ## $ jan_2015 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ feb_2015 <dbl> 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0… ## $ mar_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0… ## $ apr_2015 <dbl> 0, 0, 0, 0, 3, 2, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ may_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 4… ## $ jun_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0… ## $ jul_2015 <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 4, 5… ## $ aug_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 5… ## $ sep_2015 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 3… ## $ oct_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 5, 0, 0, 0, 0… ## $ nov_2015 <dbl> 3, 0, 0, 0, 1, 4, 0, 3, 4, 3, 0, 0, 1, 4, 0, 0, 0, 0, 4, 4… ## $ dec_2015 <dbl> 0, 0, 0, 1, 0, 0, 0, 5, 0, 4, 0, 0, 0, 3, 3, 0, 0, 0, 3, 3… ## $ jan_2016 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 3… ## $ feb_2016 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 1… ## $ mar_2016 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1, 0, 0, 0, 3, 0… ## $ apr_2016 <dbl> 3, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2… ## $ may_2016 <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3… ## $ jun_2016 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0… ## $ jul_2016 <dbl> 4, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0… ## $ aug_2016 <dbl> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 3, 0… ## $ sep_2016 <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 3, 0… ## $ oct_2016 <dbl> 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 0, 0, 0, 4, 0, 0, 4, 1, 0… ## $ nov_2016 <dbl> 5, 0, 0, 4, 2, 3, 3, 0, 0, 3, 0, 0, 0, 4, 0, 2, 0, 2, 0, 0… ## $ dec_2016 <dbl> 2, 1, 0, 0, 0, 0, 1, 5, 0, 0, 0, 1, 3, 0, 4, 2, 2, 0, 2, 0… ## $ jan_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0… ## $ feb_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0… ## $ mar_2017 <dbl> 0, 1, 0, 0, 0, 1, 0, 2, 3, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4… ## $ apr_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0… ## $ may_2017 <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 2, 0, 0, 2, 0, 3, 3, 0, 4, 3… ## $ jun_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 2, 0, 0… ## $ jul_2017 <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 2… ## $ aug_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0… ## $ sep_2017 <dbl> 2, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0… ## $ oct_2017 <dbl> 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0… ## $ nov_2017 <dbl> 0, 2, 5, 0, 0, 1, 4, 0, 2, 0, 5, 0, 2, 5, 0, 0, 3, 0, 4, 0… ## $ dec_2017 <dbl> 0, 0, 4, 0, 0, 0, 0, 5, 5, 5, 4, 4, 0, 0, 2, 4, 0, 3, 0, 2… ## $ jan_2018 <dbl> 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 3, 1… ## $ feb_2018 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ mar_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 2, 0… ## $ apr_2018 <dbl> 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 4, 0, 0… ## $ may_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ jun_2018 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 0, 0, 3… ## $ jul_2018 <dbl> 0, 0, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 0, 5… ## $ aug_2018 <dbl> 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 5… ## $ sep_2018 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 5, 0, 2, 0, 0, 0, 0, 2, 4, 0… ## $ oct_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ nov_2018 <dbl> 0, 0, 4, 0, 3, 0, 0, 2, 0, 4, 1, 0, 1, 0, 0, 0, 0, 0, 5, 0… ## $ dec_2018 <dbl> 0, 5, 0, 0, 2, 0, 0, 4, 0, 0, 0, 4, 0, 4, 1, 0, 0, 0, 2, 0…
A left join keeps all rows in the “left” data frame plus all columns from both data frames. A left join is a good “default” join to start with.
left_join(customer_data, store_transactions, join_by(customer_id))
## # A tibble: 10,531 × 182 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1002 1970 Female 31000 749. Yes No West ## 3 1003 1988 Male 35000 542. No No South ## 4 1004 1984 Other 64000 574. Yes Yes Midwest ## 5 1005 1987 Male 58000 644. No Yes West ## 6 1006 1994 Male 164000 554. Yes Yes Midwest ## 7 1007 1968 Male 39000 608. No No Midwest ## 8 1008 1994 Male 69000 710. No No South ## 9 1009 1958 Male 233000 702. No No West ## 10 1010 1994 Female 77000 605. Yes No Northeast ## # ℹ 10,521 more rows ## # ℹ 174 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>, jan_2005 <dbl>, ## # feb_2005 <dbl>, mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, ## # jun_2005 <dbl>, jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, ## # oct_2005 <dbl>, nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, ## # feb_2006 <dbl>, mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, …
An inner join keeps rows that have matching IDs along with all columns from both data frames.
inner_join(customer_data, store_transactions, join_by(customer_id))
## # A tibble: 10,531 × 182 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1001 1971 Female 73000 742. No No South ## 2 1002 1970 Female 31000 749. Yes No West ## 3 1003 1988 Male 35000 542. No No South ## 4 1004 1984 Other 64000 574. Yes Yes Midwest ## 5 1005 1987 Male 58000 644. No Yes West ## 6 1006 1994 Male 164000 554. Yes Yes Midwest ## 7 1007 1968 Male 39000 608. No No Midwest ## 8 1008 1994 Male 69000 710. No No South ## 9 1009 1958 Male 233000 702. No No West ## 10 1010 1994 Female 77000 605. Yes No Northeast ## # ℹ 10,521 more rows ## # ℹ 174 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>, jan_2005 <dbl>, ## # feb_2005 <dbl>, mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, ## # jun_2005 <dbl>, jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, ## # oct_2005 <dbl>, nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, ## # feb_2006 <dbl>, mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, …
We typically want to perform many consecutive function calls. You might be tempted to do the following. Don’t do this!
crm_data_2 <- left_join(customer_data, store_transactions, join_by(customer_id)) crm_data_3 <- filter(crm_data_2, region == "West", feb_2005 == max(feb_2005)) crm_data_4 <- mutate(crm_data_3, age = 2022 - birth_year) crm_data_5 <- select(crm_data_4, age, feb_2005) crm_data_6 <- arrange(crm_data_5, desc(age)) crm_data_7 <- slice(crm_data_6, 1) crm_data_7
## # A tibble: 1 × 2 ## age feb_2005 ## <dbl> <dbl> ## 1 65 5
This would be another way run the same code. Don’t do this either!
slice(
arrange(
select(
mutate(
filter(
left_join(
customer_data, store_transactions, join_by(customer_id)),
region == "West", feb_2005 == max(feb_2005)
),
age = 2022 - birth_year),
age, feb_2005),
desc(age)),
1)
## # A tibble: 1 × 2 ## age feb_2005 ## <dbl> <dbl> ## 1 65 5
Part of the common philosophy for the tidyverse is that:
This allows us to |> together functions in consecutive lines of code so that it is easier for humans to read and less error-prone.
filter(customer_data, credit == 850, state == "WY")
## # A tibble: 1 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1335 1978 Female 112000 850 No No West ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
customer_data |> filter(.data = _, credit == 850, state == "WY")
## # A tibble: 1 × 14 ## customer_id birth_year gender income credit married college_degree region ## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 1335 1978 Female 112000 850 No No West ## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>, ## # review_time <chr>, review_title <chr>, review_text <chr>
|> do?|>?|>.Read |> as then. (If we need to use <-, we typically put it at the beginning.)
customer_data |> left_join(store_transactions, join_by(customer_id)) |> filter(region == "West", feb_2005 == max(feb_2005)) |> mutate(age = 2023 - birth_year) |> select(age, feb_2005) |> arrange(desc(age)) |> slice(1)
## # A tibble: 1 × 2 ## age feb_2005 ## <dbl> <dbl> ## 1 66 5
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
In RStudio on Posit Cloud, create a new R script, load the tidyverse, and import the data. Then, using the |> in consecutive lines of code, find the customers who have spent the most recently by doing the following.
customer_data and store_transactions data.South.age using 2023 - birth_year.age, gender, income, credit, married, college_degree, region, and dec_2018.dec_2018 transactions.